Designed by
Bochong Chen (bc446) and Jiaxian Chen (jc3459)
Not every household has a scanner at home, and many scanning apps nowadays require active subscriptions. Therefore, we plan to implement a free-to-use document scanner system that can be easily used at home and carried in pocket. The system will mainly focus on two kinds of functions: document scanning and form filling using Optical Character Recognition (OCR) techniques.
The document scanning will use image processing techniques to detect document-like contours in the camera’s video frames. When a valid document is found, the user can apply a monochrome filter to the document, as well as scan the document. The scanned result will be displayed on the piTFT, and if the user is satisfied with the result document, a simple button click can save the image files to a designated Baidu cloud drive.
Another functionality is to recognize the texts in a document. This will be helpful when a person needs to record data from forms or receipts. In the first time of use, the user needs to provide a sample document and manually mark the Regions of Interests (ROI). After the configuration, the program can automatically find the document in the viewfinder and perform OCR in each of the ROI. A successful take will be displayed on the piTFT screen, with the recognized characters overlaid on top of the original image. The program can store multiple scans, which is automatically saved in csv files and pushed to the cloud disk.
The program allows the user to switch between the scanning mode and OCR mode, take photos, and save results by pressing the four physical buttons located on the right side of the piTFT screen. In order to perform all the functionalities with only four buttons, multiple levels of menu were implemented. The general UI layout looks like the following:
The layout can be abstracted to multiple menu states, and the program can determine its operations based on its current menu state. The states in this program include:
The document scanning frame is where all the contours are detected and displayed. To produce it, the program goes through the following pipeline:
If the user presses the “take” button on the screen. The program will do the following to obtain a scanned document:
This action is performed when the user clicked “take” under OCR mode. Unlike the scanning mode, which relies on contour detection, the OCR mode relies on ORB feature matching. ORB (Oriented FAST and Rotated BRIEF) is an image processing algorithm that extracts feature keypoints from an image. Since the user is required to upload a sample picture of the document before using OCR, the program already knows what the target looks like. When a real-world picture is taken, the program can compare and match the ORB keypoints of the real-world picture and the sample picture. If the match is good enough, the program can then directly locate the target document.
This is considered to be more robust than the contour-finding method, because the feature matching can still function properly when the document doesn’t have all the corners in the view, or when the document is not rectangle-shaped.
After the ORB matching, because the target object is located and perspective transformed, it is now aligned exactly like the sample picture. The program can then crop out each region of interest and run Tesseract OCR algorithm on each separately.
The OCR pipeline goes as follow:
After a document scan or an OCR scan, the user can choose to save the results to their designated cloud disk (currently implemented on Baidu cloud disk only). The program uses the bypy library, which is a tool that allows users to programmatically upload files to the Baidu cloud disk. When the user presses “Save” button, the program generates a jpg image or a csv file, depending on the operation mode, and upload them to the Baidu cloud disk by running “os.system(‘bypy upload XXX’). The following shows a sample csv file uploaded to the cloud disk:
We encountered several problems during the development, and through online researching as well as trial & error, we managed to resolve most of them and brought a properly functional prototype to the demo.
1. The first issue was the video frame acquisition. That is, the camera has a resolution lower than we expected. Although later we realized that the resolution can be turned up with cap.set(), the resolution boost came with a significant sacrifice on frame rate. Reading the documents, we realized that the problem did not originate from the pi camera’s hardware side but from the receiving side. We wanted to use multiprocessing to save time in decoding the frames, but we quickly found out that the pi camera firmware was not designed to be handled by multiple processes. Although multiprocessing is still possible by sharing the same camera instance among different processes, implementing it can be challenging and we decided it wasn’t worthy for our efforts.
Instead of multiprocessing, we used a multi-threading approach. We created a threaded class that constantly read frames from VideoCapture and constantly replace its frame attribute with the newest one. When the main loop asks for a frame, it will always get one instantly, except for the fact that it may not be the most up-to-date frame. However, using this method, we managed to bring the framerate to an acceptable level running at 1024 x 768.
2. Another issue we encountered was the environment setup. The issue is that Python usually works in a project-based and isolated environment. That is typically considered a good practice because it can prevent version conflict with other projects on the same computer. However, when running “sudo python3 code.py” (sudo is needed to display the screen on piTFT), the environment was changed because of the sudo privilege. Our bypy and pytessearct libraries failed because of this. Particularly, because the pytessearct module is included in Pillow, pip somehow failed to recognize the module isn’t actually installed.
After some searching, we found that the environment can actually be preserved by adding “-E” flag to the sudo command. (e.g. sudo -E python3 code.py). Therefore, we were able to run our system fully on the piTFT screen. And we reinstalled the bypy library in the sudo environment to eliminate the error.
As far as we are concerned, we managed to stay on schedule for the entire development stage. In our project proposal, we outlined the goal to implement a system that can scan documents, perform automated form processing, save results to a designated space, and interact with users using a multi-level menu. At the demo session, we managed to demonstrate all the functionalities that were proposed.
As discussed above, we have achieved what we proposed and created a functional prototype of the portable scanning system. The system can be carried easily, and the file sharing process is simple and automated. Our initial ideas have shown to be practical in real-world systems.
If we had more time, we would love to add a lighting control component to the system, so that the system can automatically adjust lighting, contrast, and exposure, etc., to achieve better results in more diverse environments (e.g in a dark environment). At the current stage, although the ORB algorithm is more powerful than the contour detection, it can potentially fail to identify the target because of some discrepancies between the sample and the actual target. We look forward to researching more on improving the accuracy and artifact rejection of this algorithm.
jc3459@cornell.edu
Designed the scanning section.
bc446@cornell.edu
Designed the OCR section.
final.py:
// final.py
import csv
import os
import pickle
import sys
from datetime import datetime
import RPi.GPIO as GPIO
import cv2
import numpy as np
import pygame
import pytesseract
from menu import Menu
from utils import find_biggest, sort_corners, draw_rectangle
from video_stream import VideoStream
# global variables
btn_lock = False
pressed_btn = None
curr_state = Menu.SCAN
menu_lst = [['OCR', 'B & W', 'Take', 'Quit'],
[' ', 'Back', 'Save', 'Quit'],
['Scan', 'Choose', 'Take', 'Quit'],
['Prev', 'Next', 'Apply', 'Quit']]
curr_menu = menu_lst[0]
thresholding = False
take_done = False
main_frame = np.zeros((240, 320, 3))
stream = VideoStream()
# find all the available OCR samples
ocr_filename = []
for filepath in os.listdir("ocr_data"):
if filepath.endswith('.pkl'):
ocr_filename.append(filepath.replace(".pkl", ""))
ocr_index = 0
my_data = []
# pre-load the ORB keypoint samples to save time
ocr_sample = cv2.imread(f'ocr_data/{ocr_filename[ocr_index]}.jpg')
ocr_roi = pickle.load(open(f'ocr_data/{ocr_filename[ocr_index]}.pkl', 'rb'))
orb = cv2.ORB_create(1000)
kp1, des1 = orb.detectAndCompute(ocr_sample, None)
bf = cv2.BFMatcher(cv2.NORM_HAMMING)
# convert a OpenCV frame to a pygame compatible one
def cvt2pygame(frame):
frame = cv2.cvtColor(np.float32(frame), cv2.COLOR_BGR2RGB)
frame = cv2.resize(frame, (300, 240)) # resize to screen size
frame = cv2.flip(frame, 1) # horizontal flip
frame = cv2.rotate(frame, cv2.cv2.ROTATE_90_COUNTERCLOCKWISE)
return pygame.surfarray.make_surface(frame)
# interrupt callback function upon button pressed
def button_press_callback(channel):
global btn_lock, pressed_btn
if not btn_lock:
pressed_btn = channel
# change menu state according to the pressed button
def change_state():
global pressed_btn, curr_state, curr_menu, thresholding, \
take_done, my_data, ocr_sample, ocr_roi, ocr_index, ocr_filename
if curr_state == Menu.SCAN:
if pressed_btn == 22:
curr_state = Menu.SCAN_TAKE
elif pressed_btn == 23:
thresholding = not thresholding
elif pressed_btn == 27:
# change back to color mode before going to OCR mode
thresholding = False
curr_state = Menu.OCR
curr_menu = menu_lst[2]
elif curr_state == Menu.SCAN_TAKE:
if take_done: # check if the take action has already finished
take_done = False
curr_state = Menu.SCAN_TAKEN
curr_menu = menu_lst[1]
elif curr_state == Menu.SCAN_TAKEN:
if pressed_btn == 22:
curr_state = Menu.SCAN
curr_menu = menu_lst[0]
# save scan results
time = datetime.now()
filename = f'saved_data/scanned_document_{str(time.date())}_{time.hour}_{time.minute}_{time.second}.jpg'
cv2.imwrite(filename, main_frame)
os.system(f'bypy upload {filename}')
elif pressed_btn == 23:
curr_state = Menu.SCAN
curr_menu = menu_lst[0]
elif curr_state == Menu.OCR:
if pressed_btn == 22:
curr_state = Menu.OCR_TAKE
elif pressed_btn == 23:
curr_state = Menu.OCR_CHOOSE
curr_menu = menu_lst[3]
elif pressed_btn == 27:
curr_state = Menu.SCAN
curr_menu = menu_lst[0]
elif curr_state == Menu.OCR_TAKE:
if take_done: # check if the take action has already finished
take_done = False
curr_state = Menu.OCR_TAKEN
curr_menu = menu_lst[1]
elif curr_state == Menu.OCR_TAKEN:
if pressed_btn == 22:
curr_state = Menu.OCR
curr_menu = menu_lst[2]
# save OCR results
time = datetime.now()
filename = f'saved_data/ocr_document_{str(time.date())}_{time.hour}_{time.minute}_{time.second}.csv'
with open(filename, 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow([i[3] for i in ocr_roi])
csv_writer.writerow(my_data)
os.system(f'bypy upload {filename}')
elif pressed_btn == 23:
curr_state = Menu.OCR
curr_menu = menu_lst[2]
elif curr_state == Menu.OCR_CHOOSE:
if pressed_btn == 22:
curr_state = Menu.OCR
curr_menu = menu_lst[2]
elif pressed_btn == 23:
# switch the OCR sample to the next available
if ocr_index < len(ocr_filename) - 1:
ocr_index += 1
ocr_sample = cv2.imread(f'ocr_data/{ocr_filename[ocr_index]}.jpg')
ocr_roi = pickle.load(open(f'ocr_data/{ocr_filename[ocr_index]}.pkl', 'rb'))
elif pressed_btn == 27:
# switch the OCR sample to the previous available
if ocr_index > 0:
ocr_index -= 1
ocr_sample = cv2.imread(f'ocr_data/{ocr_filename[ocr_index]}.jpg')
ocr_roi = pickle.load(open(f'ocr_data/{ocr_filename[ocr_index]}.pkl', 'rb'))
# state change is finished, erase pressed_btn
pressed_btn = None
# obtain frame depending on menu state
def get_frame():
global main_frame, curr_state
if curr_state == Menu.SCAN or curr_state == Menu.OCR:
main_frame, success = get_scanning_frame(False)
elif curr_state == Menu.SCAN_TAKE:
main_frame, success = get_scanning_frame(True)
elif curr_state == Menu.OCR_TAKE:
main_frame, myData = get_ocr_frame()
elif curr_state == Menu.SCAN_TAKEN or curr_state == Menu.OCR_TAKEN:
pass
elif curr_state == Menu.OCR_CHOOSE:
main_frame = get_ocr_sample_frame()
# load the sample picture and overlay the fields
def get_ocr_sample_frame():
img_show = ocr_sample.copy()
img_mask = np.zeros_like(img_show)
for x, r in enumerate(ocr_roi):
cv2.rectangle(img_mask, r[0], r[1], (0, 255, 0), cv2.FILLED)
img_show = cv2.addWeighted(img_show, 0.99, img_mask, 0.1, 0) # overlay ROI rectangles
cv2.putText(img_show, r[3], r[0], cv2.FONT_HERSHEY_PLAIN, 3, (0, 0, 255), 3) # overlay field text
return img_show
# acquire a scanning preview or result frame
def get_scanning_frame(get_result=False):
global stream, thresholding, take_done
img = stream.read().copy()
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # turn gray for later edge detection
img_blur = cv2.GaussianBlur(img_gray, (5, 5), 1) # blur to remove high frequency noises
img_threshold = cv2.Canny(img_blur, 100, 150) # get picture with edges only
kernel = np.ones((5, 5))
img_dial = cv2.dilate(img_threshold, kernel, iterations=2)
img_threshold = cv2.erode(img_dial, kernel, iterations=1)
contours, _ = cv2.findContours(img_threshold, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) # find closed contours
biggest, maxArea = find_biggest(contours) # find the biggest contour with 4 sides
if get_result and biggest.size != 0: # this means the program wants the scanned result
biggest = sort_corners(biggest)
pts1 = np.float32(biggest)
pts2 = np.float32(
[[0, 0], [img.shape[1], 0], [0, img.shape[0]], [img.shape[1], img.shape[0]]])
matrix = cv2.getPerspectiveTransform(pts1, pts2)
img = cv2.warpPerspective(img, matrix, (img.shape[1], img.shape[0])) # perform perspective transform
if thresholding:
# apply adaptive thresholding
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
take_done = True
return img, True
else: # this means the program just want a preview frame
if thresholding:
# apply adaptive thresholding
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
if biggest.size == 0:
cv2.drawContours(img, contours, -1, (0, 255, 0), 5)
else:
biggest = sort_corners(biggest)
img = draw_rectangle(img, biggest, 5)
return img, False
def get_ocr_frame():
global stream, take_done, ocr_sample, ocr_roi, orb, kp1, des1, bf, my_data
# ORB is still not robust enough
# for now, perform document detection first, then ORB
img, success = get_scanning_frame(True)
while not success:
img, success = get_scanning_frame(True)
img = stream.read().copy()
kp2, des2 = orb.detectAndCompute(img, None)
matches = bf.match(des2, des1)
matches.sort(key=lambda x: x.distance)
good_matches = matches[:int(len(matches) * 0.25)]
src_points = np.float32([kp2[m.queryIdx].pt for m in good_matches]).reshape(-1, 1, 2)
dest_points = np.float32([kp1[m.trainIdx].pt for m in good_matches]).reshape(-1, 1, 2)
M, _ = cv2.findHomography(src_points, dest_points, cv2.RANSAC, 5.0)
img_scan = cv2.warpPerspective(img, M, (ocr_sample.shape[1], ocr_sample.shape[0]))
img_scan = cv2.resize(img_scan, (ocr_sample.shape[1], ocr_sample.shape[0]))
img_show = img_scan.copy()
img_mask = np.zeros_like(img_show)
my_data = []
for x, r in enumerate(ocr_roi):
cv2.rectangle(img_mask, r[0], r[1], (0, 255, 0), cv2.FILLED)
img_show = cv2.addWeighted(img_show, 0.99, img_mask, 0.1, 0)
img_crop = img_scan[r[0][1]: r[1][1], r[0][0]: r[1][0]]
if r[2] == 'text':
my_data.append(pytesseract.image_to_string(img_crop)
.replace("\x0c", "").replace("\n", ""))
elif r[2] == 'box':
imgGray = cv2.cvtColor(img_crop, cv2.COLOR_BGR2GRAY)
imgThreshold = cv2.threshold(imgGray, 170, 255, cv2.THRESH_BINARY_INV)[1]
isChecked = cv2.countNonZero(imgThreshold) > 0.5 * imgGray.shape[0] * imgGray.shape[1]
my_data.append(isChecked)
cv2.putText(img_show, str(my_data[x]), r[0], cv2.FONT_HERSHEY_PLAIN, 3, (0, 0, 255), 3)
print(my_data)
take_done = True
return img_show, my_data
# main loop
def main():
global btn_lock, pressed_btn, curr_state, stream, main_frame, curr_menu
# os.putenv('SDL_VIDEODRIVER', 'fbcon') # Display on piTFT
# os.putenv('SDL_FBDEV', '/dev/fb1') #
# os.putenv('SDL_MOUSEDRV', 'TSLIB') # Track mouse clicks on piTFT
# os.putenv('SDL_MOUSEDEV', '/dev/input/touchscreen')
btn_pins = [17, 22, 23, 27]
GPIO.setmode(GPIO.BCM)
GPIO.setup(btn_pins, GPIO.IN, pull_up_down=GPIO.PUD_UP)
WHITE = (255, 255, 255)
BLACK = (0, 0, 0)
text_pos = [(305, 200), (305, 140), (305, 80), (305, 20)]
pygame.init()
pygame.mouse.set_visible(False)
screen = pygame.display.set_mode((320, 240))
screen.fill(BLACK)
font = pygame.font.Font(None, 20)
for btn_pin in btn_pins: # bind button press events
GPIO.add_event_detect(btn_pin, GPIO.FALLING, callback=button_press_callback, bouncetime=300)
stream.start()
pytesseract.pytesseract.tesseract_cmd = "/usr/bin/tesseract"
try:
while True:
for event in pygame.event.get():
if event.type == pygame.QUIT:
sys.exit()
btn_lock = True
if pressed_btn == 17:
sys.exit()
change_state()
btn_lock = False
get_frame()
screen.fill(BLACK)
screen.blit(cvt2pygame(main_frame), (0, 0))
for i in range(4):
text = font.render(curr_menu[i], True, WHITE)
text = pygame.transform.rotate(text, 90)
screen.blit(text, text_pos[i])
pygame.display.flip()
finally:
print("Program finished")
stream.cleanup()
cv2.destroyAllWindows()
pygame.quit()
if __name__ == "__main__":
main()
menu.py:
from enum import Enum, unique
@unique
class Menu(Enum):
SCAN = 1
SCAN_TAKE = 2
SCAN_TAKEN = 3
OCR = 4
OCR_TAKE = 5
OCR_TAKEN = 6
OCR_CHOOSE = 7
utils:
import cv2
import numpy as np
def sort_corners(myPoints):
myPoints = myPoints.reshape((4, 2))
myPointsNew = np.zeros((4, 1, 2), dtype=np.int32)
add = myPoints.sum(1)
myPointsNew[0] = myPoints[np.argmin(add)]
myPointsNew[3] = myPoints[np.argmax(add)]
diff = np.diff(myPoints, axis=1)
myPointsNew[1] = myPoints[np.argmin(diff)]
myPointsNew[2] = myPoints[np.argmax(diff)]
return myPointsNew
def find_biggest(contours):
biggest = np.array([])
max_area = 0
for i in contours:
area = cv2.contourArea(i)
if area > 5000:
peri = cv2.arcLength(i, True)
approx = cv2.approxPolyDP(i, 0.02 * peri, True)
if area > max_area and len(approx) == 4:
biggest = approx
max_area = area
return biggest, max_area
def draw_rectangle(img, biggest, thickness):
cv2.line(img, (biggest[0][0][0], biggest[0][0][1]), (biggest[1][0][0], biggest[1][0][1]), (0, 255, 0), thickness)
cv2.line(img, (biggest[0][0][0], biggest[0][0][1]), (biggest[2][0][0], biggest[2][0][1]), (0, 255, 0), thickness)
cv2.line(img, (biggest[3][0][0], biggest[3][0][1]), (biggest[2][0][0], biggest[2][0][1]), (0, 255, 0), thickness)
cv2.line(img, (biggest[3][0][0], biggest[3][0][1]), (biggest[1][0][0], biggest[1][0][1]), (0, 255, 0), thickness)
return img
videostream.py:
import cv2.cv2
import numpy as np
class VideoStream:
def __init__(self):
self.cap = cv2.VideoCapture(0)
self.cap.set(cv2.cv2.CAP_PROP_FRAME_WIDTH, 1024)
self.cap.set(cv2.cv2.CAP_PROP_FRAME_HEIGHT, 576)
self.stopped = False
self.frame = np.zeros((576, 1024, 3), np.uint8)
def start(self):
t = Thread(target=self.update, args=())
t.daemon = True
t.start()
return self
def update(self):
while True:
if self.stopped:
return
ret, new_frame = self.cap.read()
if not ret:
self.stop()
return
self.frame = new_frame
def read(self):
return self.frame
def stop(self):
self.stopped = True
def cleanup(self):
self.cap.release()